Data Import and Cleaning

Data Import

videogames.df <- read.csv(file.path(project.dir, dataset.dir, 'vgsales-12-4-2019.csv'))
colnames(videogames.df)
##  [1] "Rank"           "Name"           "basename"       "Genre"         
##  [5] "ESRB_Rating"    "Platform"       "Publisher"      "Developer"     
##  [9] "VGChartz_Score" "Critic_Score"   "User_Score"     "Total_Shipped" 
## [13] "Global_Sales"   "NA_Sales"       "PAL_Sales"      "JP_Sales"      
## [17] "Other_Sales"    "Year"           "Last_Update"    "url"           
## [21] "status"         "Vgchartzscore"  "img_url"

Data cleaning

Since the data was collected in April of 2019, we are excluding games with year = 2019 since it does not give a comprehensive picture of all the sales during 2019.

videogames.clean <- videogames.df %>% filter(Year < 2019)

Data reshaping

We want to compare sales across different regions, so it would be convenient to have one column “region” and then a corresponding column for sales in USD (millions).

vs_byregion <- videogames.clean %>% gather(Region, Sales, Global_Sales:Other_Sales, na.rm = T)

1. Descriptive analysis

Conduct some descriptive analysis on the data, figuring out: * distributions of variables, * variables that appear to be strongly related with each other (using appropriate methods to quantify the relationships based on whether variables are numerical or categorical).

All Sales

From the boxplot we can see that we have 2 extreme outliers. After investigating, it looks like two outliers are the for GTA V (ps3 and ps4)

hist(videogames.clean$Global_Sales, xlab = 'Global Sales (millions of USD)')

hist(videogames.clean$Global_Sales,
     xlab = 'Global Sales (millions of USD)',
     xlim = c(0, .5),
     breaks = 10000)

boxplot(videogames.clean$Global_Sales, xlab = 'Global Sales (millions of USD)')

videogames.clean[which(videogames.clean$Global_Sales > 17), ]

Sales by year

vs_sales.byregion.byyear <- vs_byregion %>% group_by(Year, Region)  %>% summarize(Sales = sum(Sales))
vs_sales.byregion.byyear %>% ggplot(aes(x=Year, y= Sales))+
  geom_line(aes(color = Region))

Critics

fig <- plot_ly(x = videogames.clean$Critic_Score[which(!is.na(videogames.clean$Critic_Score))],
             type = "histogram")

fig